{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Quadratic Optimization\n",
    "## Introduction\n",
    "Experience will soon tell you that amid the wide range of solvers that support NLP, most provide support specifically for quadratic functions. This is because there is a wide range of problems that deal specifically with quadratic functions, rather than with generic non-linear functions. Recall that a quadratic function is a polynomial function of the decision variables with terms of at most second order degree.\n",
    "For instance, an outstanding example of quadratic function is the quadratic error function: \n",
    "\n",
    "$Loss(z) = \\frac{z^2}{2}$\n",
    "\n",
    "Any optimization problem that aims to minimize a given error can use this empirical error function as the objective function. The next subsections present some examples of quadratic optimization problems that use exclusively this type of functions. \n",
    "\n",
    "## Machine Learning Linear Regression as a Quadratic Unconstrained NLP\n",
    "### Supervised Learning Regression Problem Set-Up\n",
    "The objective of a linear regression problem is to *explain* a response variable, or dependent variable, based on the \n",
    "values of a set of explanatory variables, or independent variables, assuming that the relationship between the \n",
    "response variable and the explanatory variables is linear. \n",
    "\n",
    "That is, given: \n",
    "\n",
    "- $x^{(t)}$: feature vector t. Sample vector t of explanatory variables (e.g. different characteristics of people like gender, age, region, device, …). $x(t)$ is a d-dimensional vector (i.e. $x^{(t)} \\in \\mathbb{R^d}$)\n",
    "- $y^{(t)}$: Label t. Response variable t that we want to explain (e.g. for instance the probability that a person is a potential customer or not). Let us assume that every $y^{(t)}$ is a real value (i.e. $y^{(t)} \\in \\mathbb{R}$)\n",
    "- $S_n$: Training set containing all labeled feature vectors, i.e. $S_n = \\{x^{(t)}, y^{(t)} | t=1,...,n \\}$\n",
    "\n",
    "The objective is to learn the linear mapping between the feature vector space x and the response variable y: \n",
    "\n",
    "$\\hat y = f(x, \\theta, \\theta_0) = \\sum_{i=1}^{d}{x*\\theta} + \\theta_0 = x·\\theta + \\theta_0$\n",
    "\n",
    "So that the value of y can be estimated for any point in the vector space x based on the training data set. The parameters to learn are:\n",
    "\n",
    "- $\\theta$: Linear regression function coefficients\n",
    "- $\\theta_0$: Linear regression independent term (Let us assume it is zero for now, without loss of generality)\n",
    "\n",
    "In general, the objective is to minimise the **empirical risk**, noted as $R_n$, which is the average of the estimation error at every training point t. Let as for now note $\\text{Loss}$ the loss function that represents the error at every training point. Then, the objective function is:\n",
    "\n",
    "$\\min R_n(\\theta) = \\frac{1}{n}\\sum_{t=1}^{n}{\\text{Loss}( y^{(t)} - x^{(t)}·\\theta)}$\n",
    "    \n",
    "Now, if we consider the quadratic error function, the empirical risk becomes: \n",
    "\n",
    "$\\min R_n(\\theta) = \\frac{1}{n}\\sum_{t=1}^{n}{\\frac{( y^{(t)} - x^{(t)}·\\theta)^2}{2}}$\n",
    "\n",
    "This minimisation problem is a quadratic optimization problem where the decision variables are the components of $\\theta$.\n",
    "\n",
    "Thus, we can compute the closed form of the gradient as: \n",
    "\n",
    "$\\nabla_{\\theta} R_n(\\theta) = \\frac{1}{n}\\sum_{t=1}^{n}{\\nabla_{\\theta}\\frac{( y^{(t)} - x^{(t)}·\\theta)^2}{2}} = \\frac{1}{n}\\sum_{t=1}^{n}{\\left [ (y^{(t)}-x^{(t)}·\\theta)*(-x^{(t)})\\right ]} = -\\frac{1}{n}\\sum_{t=1}^{n}{y^{(t)}*x^{(t)}}-\\frac{1}{n}\\sum_{t=1}^{n}{(x^{(t)}·\\theta)*-x^{(t)}}$\n",
    "\n",
    "Now, note that the first term is the sum-product of scalars ($y^{(t)}$) and vectors of size d ($x^{(t)}$) and therefore, it is a vector. Let us note this first term as b:\n",
    "\n",
    "$b = \\frac{1}{n}\\sum_{t=1}^{n}{y^{(t)}*x^{(t)}}$\n",
    "\n",
    "As for the second term, note that is the sum of the scalar product of two vectors of size d (which is a scalar) times another vector of size d. We can apply the following transformation: \n",
    "\n",
    "$-\\frac{1}{n}\\sum_{t=1}^{n}{(x^{(t)}·\\theta)*-x^{(t)}} = \\frac{1}{n}\\sum_{t=1}^{n}{x^{(t)}*(x^{(t)})^T}*\\theta = A*\\theta$  \n",
    "\n",
    "That is, first we use the vector product properties to place together the two $x^{(t)}$ and then we move $\\theta$ outside the summation since it does not depend on t.\n",
    "\n",
    "With this transformation, the closed form of the optimisation problem now becomes: \n",
    "\n",
    "\n",
    "$\\nabla_{\\theta} R_n(\\theta) = -b + A\\theta = 0 \\rightarrow A\\theta = b$\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "source": [],
    "metadata": {
     "collapsed": false
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}